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Online Radicalization (also called Cyber- Terrorism or Extremism or Cyber-Racism or Cyber- 
Hate) is widespread and has become a major and growing concern to the society, governments 
and law enforcement agencies around the world. Research shows that various platforms on the 
Internet (low barrier to publish content, allows anonymity, provides exposure to millions of users 
and a potential of a very quick and widespread diffusion of message) such as YouTube (a popular 
video sharing website). Twitter (an online micro-blogging service), Facebook (a popular social 
iictworkiug website), online discussion forums and blogosphere are being misused for malicious 
intent. Such platforms are being used to form hate groups, racist communities, spread extremist 
agenda, incite anger or violence, promote radicalization, recruit members and create virtual organi- 
zations and communities. Automatic detection of online radicalization is a technically challenging 
problem because of the vast amount of the data, unstructured and noisy user-generated content, 
dynamically changing content and adversary behavior. There are several solutions proposed in 
the literature aiming to combat and counter cyber-hatc and cyber-extremism. In this survey, we 
review solutions to detect and analyze online radicalization. We review 40 papers published at 
12 venues from June 2003 to November 2011. We present a novel classification scheme to classify 
these papers. We analyze these techniques, perform trend analysis, discuss limitations of existing 
techniques and find out research gaps. 

Categories and Subject Descriptors: A.l [General Literature]: INTRODUCTORY AND SUR- 
VEY 

General Terms: Algorithms 

Additional Key Words and Phrases: Security, Social Networks, Cyber Extremism, Online Radi- 
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1. INTRODUCTION 

Internet provides its users with anonymity, low publication barrier, low cost of pub- 
lishing and managing content. The advent of Web 2.0 and the meteoric rise of social 
media has facilitated Internet users to freely disseminate ideas and opinions using 
multiple modalities like blogs, social networking websites, forums and video sharing 
websites. These characteristics are exploited by malicious online users and groups 
for malignant activities like hate propaganda, potential recruitment, fundraising, 
brainwashing [Glascr ct al. 2002]. The Internet is being used for various unlawful 
activities including a place to practice racism and xenophobia [Burris 2000]. 

1.1 Online Radicalization 

Online radicalization (also called cyber extremism and cyber hate propaganda) is 
a growing concern to the society and also of great pertinence to governments & 
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law enforcement agencies. The ease of publishing and assimilating content on the 
Internet via social media and video sharing websites amongst others coupled with 
high information diffusion rates has led to faster content dissemination and larger 
audience reach. This has brought together researchers around the world from vari- 
ous disciplines like psychology, social sciences and computer sciences to understand 
the problem of online radicalization and develop tools & techniques to counter it. 
It has also spawned an interdisciplinary research topic : Intelligence and Security 
Informatics {ISYj and a dedicated venue, IEEE International Conference on Intelli- 
gence and Security Informations ^ (started in 2003), where leading ISI researchers 
around the world meet to discuss challenges and future trends. Intelligence and 
security informatics (ISI) is defined as an interdisciplinary research area concerned 
with the study of the development and use of advanced information technologies and 
systems for national, international, and societal security-related applications [Chau 
et al. 2011]. 

Automatically detecting and analyzing online radicalization is one of the impor- 
tant themes of research in the domain of Intelligence and Security Informatics. Due 
to the inherent characteristics of the Internet, hate groups have been increasing their 
online presence over the years. Hate users and groups exist on various computer 
communication medium like WWW, Blogs, Newsgroups (Yahoo & MSN groups). 
Such users also use other medium like Hosting Services, Podcasts and Games to 
spread xenophobic information. Various live services like Internet Relay Chat (IRC) 
and Internet Radio Broadcasts are available to connect with individuals to share and 
promote progaganda [Franklin 2010]. The increasing presence and enormous power 
of social media has appealed to extremist organizations and individuals. There's a 
significant surge every year in the number of hate promoting groups on social net- 
working websites like Facebook, Twitter etc [Simon- Wiesenthal-Center 2011]. This 
makes the detection, analysis and monitoring of radical content on the Internet is 
a significant step in realizing a safer web. Various projects like Dark Web project 
funded by the National Science Foundation (NSF) and the Princip project funded 
by the Safer Internet Action Plan of the European Commission have sprung up in 
the last decade with the goal of a Safe Internet devoid of racism and xenophobia 
[Briot 2002] [Qin et al. 2005]. Law enforcement agencies and governments across 
the world have realized the importance of detecting and analyzing the presence of 
online radicalization. 

1.2 Countering Online Radicalization : Technical Challenges 

While detecting radical online content is an important problem, there are various 
issues faced by security analysts working at law enforcemcait agencies. Such content 
may be extremely covert, avoiding indexing from traditional web crawlers. Much of 
this content also tends to be fleeting comprising of various data formats including 
text, image and video. The volume of the content on Internet and the mercurial 
rise of social media has further aggravated the challenge of discovering such con- 
tent. For example. Bob is a security analyst who works for a law enforcement 
agency in country X. A t}rpical working day for Bob involves finding online do- 
mestic terrorists, radical communities and content on YouTube (specifically, for his 



'^http : //www . isiconf erence . org/ 
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country X). YouTube, a popular video sharing website, attracts 100 million active 
users per week and 3 billion video views per day.^ Bob, although a well qualified 
security analyst, is an elementary computer user who performs such tasks by query- 
ing the web and manually browsing through content. Bob finds this an arduous 
task, particularly finding it difficult to observe common traits & trends amongst 
the voluminous content. Bob also encounters problems in visualizing the data and 
the relationship between users who create and share such content. Figure 1 shows 
the process undertaken by security analysts like Bob working for law enforcement 
agencies around the world. The principal requirements of security analysts or law 
enforcement agent is the following ^: 

(1) Radical online content 

(2) Implicit and explicit virtual communities 

(3) Authoritative and instrumental users 

Security analysts working for various law enforcement agencies around the world 
empathize with Bob's situation. In addition to law enforcement agencies, vari- 
ous benevolent organizations like the Project for Research of Islamist Movements 
(PRISM) and the Search for International Terrorist Entities (SITE Institute) man- 
ually search the Internet and analyze extremist content. However, keyword based 
approaches may result in many false positives or off-topic content [Gerstenfeld 
et al. 2003]. Moreover, the unstructured and informal nature of content (abbrevia- 
tions, colloquialism, transliterations) adds further to the complexity of the problem. 
Hence, manual search for detecting, collecting and analyzing such content is not 
only time-consuming but also infeasible. 

In order to assist security analysts like Bob, a number of techniques have been 
proposed in literature to automatically detect and analyze radical content on the 
Internet. The various novel techniques proposed bring multiple paradigms and 
perspectives to the table. This makes the task of selecting and reviewing these 
techniques challenging but stimulating. The first symposium of Intcilligence and 
Security Informatics was held by National Science Foimdation(NSF) / National 
Institute of Justice(NIJ) in 2003 at Tuscon, Arizona.** Ever since, detecting and 
analyzing online radicalization has garnered interest from researchers across various 
quarters around the world. Over the past 9 years, many techniques have been 
proposed to tackle the problem of online radicalization detection and analysis. 

1.3 Survey : Problem Definition 

We perform an exhaustive survey on literature addressing the problem of automated 
online radicalization detection and its analysis. To the best of our knowledge, this 
is the first survey on Autom.ated Solutions to Detect and Analyze Online Radical- 
ization. The solutions we survey would help security analysts to identify three 
attributes - radical users, content and communities. We characterize the literature 
space into multiple taxonomies where each taxonomy cuts through a particular facet 
and has a set of associated features. These facets are carefully chosen to provide 



^http: //www.youtube . coiii/t/press_statlstics 

^Based on inputs from senior officers from law enforcement agencies 
*http : //www . isiconf erence . org/2003/ index . htm 
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Fig. 1. A typical process followed by security analysts at law enforcement agencies 

multiple perspectives on online radicalization detection approaches. We envisage 
these facets to be useful for both - researchers and law enforcement agencies. Re- 
searchers can refer to this survey to understand various current approaches, while 
law enforcement agencies can refer to this survey to understand the state-of-the-art 
techniques. This survey would assist security analysts to build tools in order to de- 
tect and analyze online radicalization, keeping in mind the limitations, challenges 
& issues with these techniques. 

1.4 Contributions 

The main contributions of this paper are as follows : 

(1) This is the first survey in the literature space of Automated Solutions to 
Detect and Analyze Online Radicalization (also called cyber extremism, hate 
propaganda, cyber racism) 

(2) We propose a novel classification scheme across multiple meaningful dimensions 
to help both researchers and law enforcement agencies 

(3) We compare and contrast techniques used to detect and analyze online radical- 
ization. We analyze these techniques to inspect trends and point out research 
gaps . 

The rest of the paper is organized as follows : Section 2 explains the literature 
review process followed in selecting and analyzing papers. Section 3 discusses the 
facets for paper classification, Section 4 and Section 5 classifies the selected articles 
into different facets and briefly outlines the approaches. Section 6 analyzes the 
techniques used, trends observed and limitations of the proposed approaches in 
literature. Section 7 discusses research gaps. Section 8 gives concluding remarks. 

2. LITERATURE REVIEW 

The objectives of the work presented this study is as follows : 

(a) To investigate of different techniques used to detect and analyze radical content 
on the Internet 
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(b) To perform an in-depth study of these techniques to analyze common trends 

(c) To examine limitations in these techniques and point out research gaps 

2.1 Review Process 

In order to meet the above objectives, we collected relevant literature by following 
these steps. Figure 2 shows the literature review process followed in the survey. 

(1) Collection - The first step involved collecting all papers in the IEEE Intelli- 
gence and Security Informatics (ISI) Proceedings^ 

(2) Relevant Paper Selection - We then filtered out papers whose main contri- 
butions were to develop automated techniques for online radical content detec- 
tion or analyze online radical content 

(3) Citation Snowball - We then performed citation analysis by browsing through 
the references of these papers and repeat Steps (2) and (3) till all relevant lit- 
erature was exhausted 

(4) Facet Listing In this step, wc listed down various facets according to which 
the papers would be classified. The authors of this paper individually performed 
this task to come up with a paper classification scheme. The authors then sat 
together and finalized the facet list ironing out disagreements in the process. 
This ensures rigor in forming the classification scheme and reduces potential 
bias. 

(5) Paper Classification - This step classified each paper into its appropriate 
facet bucket. Just as the previous step, the authors performed the classifi- 
cation individually. The authors then discussed disagreements on the paper 
classification buckets till a convergence was achieved. 

(6) Solution Analysis Finally, wc present an analysis of the selected literature 
across various facets, discuss observed trends, limitations and also point out 
research gaps. 

2.2 Related Problems and Scope 

There are several closely related research problems to automated detection and 
analysis of online radicalization in the field of ISI. Some of them include Deception 
Detection, Infrastructure Protection & Cyber Security and Criminal Data Mining 
& Network Analysis. 

. Deception detection is the task of detecting untruthful and subterfuge content 
[Chen et al. 2004]. Deceptive content generated with the objective of propaganda, 
brainwashing etc. may mislead Internet users. On the other hand, detecting 
online radicalization concerns more with discovering racist, hate promoting and 
xenophobic online content. 

. Infrastructure protection & Cyber security deals with the protection of critical 

cyber infrastructure against malicious attacks [Yunos and Hafidz Suid 2010]. 
These attacks may not necessarily be generated by radical users. 



http : //www . isiconf erence . org/ 
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Fig. 2. Literature review process followed in our survey 

. Criminal data mining & Network analysis concerns with mining and analyzing 
criminal network data to reveal hidden patterns and insights [Derosa 2004] . This 
data is generally recorded by law enforcement agencies as a formal process in 
their dealings with daily crime. 

In contrast to above problems, we focus on techniques which are used to detect 
& analyze radical content available on the Internet. These solutions are devised 
to assist security analysts working for law enforcement agencies to understand un- 
lawful usage of the Internet, in particular for radicalization. These solutions also 
cover analysis and summarization of online radical content, users and their online 
relationships. Hence, papers with topics like Deception Detection, Infrastructure 
Protection & Cyber Security and Criminal Data Mining & Network Analysis as 
their main goal are not considered and beyond scope of this survey. 

2.3 Statistics 

We have rigorously chosen a collection of 40 papers published at 7 venues and 5 
journals from June 2003 to November 2011. These papers were selected as they have 
listed down their main contribution as either a solution for automatically detecting 
online radicalization or performed analysis on online radical content. Table I shows 
the distribution of papers across conference venues and journals. 

3. FACETS FOR ARTICLE CLASSIFICATION 

The goal of this survey is to provide a research map of the literature on online rad- 
icalization detection and analysis techniques. This would help both researchers as 
well was security analysts at law enforcement agencies to understand the structure 
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Table I. Distribution of papers across conferences and journals 



1 Number of Papers 


Conference 


28 


Journals 


12 


Total 


40 



of the research. In order to achieve this goal, we identify multiple facets from exist- 
ing papers which would help characterize the solutions. These facets also provide 
key insights into trends, limitations and advantages of the techniques. 

In order to provide a better structure to the research map, we divide our survey 
into two parts and classify articles accordingly : Techniques for Online Radicaliza- 
tion Detection and Automated techniques for Online Radicalization Analysis. We 
define the two problem as follows: 

/ Online Radicalization Detection is the problem of detecting radical (hate pro- 
moting or extremist) content on the Internet. Detecting online radicalization is 
a genre-centric information retrieval problem. Typically the input is WWW (or 
a part of it) and the output is the detected extremist content. 

/ Online Radicalization Analysis on the other hand pertains to analyzing extremist 
content to gain deeper insight into the functioning of extremist groups on the 
Internet and their usage. Typically the input to an analysis algorithm would 
be documents containing radical content and the output is a detailed analysis 
explaining the behavioral, structural and linguistic characteristics of the content. 

We performcid a meticulous analysis of literature space and concluded that this 
structure would help us draw better insights and analyze trends. In addition, it 
provides us the flexibility to choose different facets for two closely related problems 
resulting in a better organization of literature. A few articles in the literature 
contributed to both automatic detection and analysis of online radicalization. These 
articles are individually surveyed with respect to both contributions and hence, 
included in both parts of the survey. They are also characterized according to two 
sets of facets in both parts of the survey. 

Table II. Distribution of papers surveyed which address the problem of online radicalization de- 
tection or analysis 



Problem 


Number of Papers 


Online Radicalization Detection 


18 


Online Radicalization Analysis 


22 


Total 


40 



We perform a rigorous evaluation of the literature in the area. We then identify 
properties in each article to characterize the proposed techniques. These properties 
are analyzed to form an initial list of facets. These facets are then discussed and 
refined to form a final list. Each facet was chosen to succinctly position all articles 
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in order to help researchers and security analysts locate techniques as per their 
requirements. 

3.1 Online Radicalization Detection : Facets 

In this part of the survey, we identify literature which contribute towards automatic 

online radicalization detection cither as their main goal or a part of their goals. We 
search for literature whose main aim is to search, collect and detect extremist 
content on the Internet. For example, we collect papers whose primary objective 
is to find extremist websites or hate promoting online forums. Concretely, the 
algorithms of papers in this section of the survey receive the WWW as input and 
the algorithms proceed to output extremist content on WWW. We then proceed to 
analyze each paper and create a list of facets to create a structured research map. 
Each facet has a list of properties associated with it. The facets are as follows : 

* Technique : What techniques are used to detect radical content on the web? 

* Features : What features do these techniques use? 

* Modalities : Which modalities on the web do they consider? 

* Data Type : What data types do these techniques cater to? 

* Output : How arc the results presented? 

* Evaluation : How are these techniques evaluated? 

* Language : What languages do these techniques address? 

* Genre : What genre of hate is being detected? 

These facets and it's properties along with brief descriptions are listed in Table 
IV. 

3.1.1 Technique. The key differentiator between proposed solutions for online 
radicalization detection. The most common technique used to detect radical content 
on the web is using a Link Based Bootstrapping (LBB) algorithm. There are other 
techniques like machine learning and multi-agents which have also been explored 
for detecting racist, hate promoting content. 

Web mining is the process of applying data mining techniques to unearth infor- 
mation from web documents [Etzioni 1996] . Web mining can be further categorized 
into three parts : Web content mining, Web structure mining and Web usage mining 
[Kosala and Blocked 2000]. Web content mining entails the process of discovering 
relevant useful information from web documents including text, audio, video, meta- 
data and hyperlinks. Web structure mining refers to the discovery of the structure 
of the web by exploiting the hyperlinks between web documents [Chakrabarti et al. 
1999] . Web usage mining concerns with the analysis of patterns in web usage data 
which includes search logs, clickthrough data, server logs etc. [Cooley et al. 1997]. 
The above three categories arc not necessary mutually exclusive and hybrid ap- 
proaches have been successfully used to discover relevant information from web 
documents [Kosala and Blockeel 2000]. Major solutions proposed to detect on- 
line radicalization employ a combination of web structure mining and web content 
mining [Qin et al. 2005]. Link Based Bootstrapping approach is a type of web 
structure mining which harnesses the hyperlink structure of the websites or the 
social links between users. Typically, it starts with some seed data as input and 
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then use the hyperhnks of websites (or social Unks between users) to find related 
content. Spiders with appropriate filtering parameters and proxies are employed 
to help the collect the documents while keeping in interest adversarial conditions 
[Zhou et al. 2007] . Other processing steps like duplicate content removal, collection 
update procedures are performed to maintain the integrity and consistency of data. 
Link Based Bootstrapping (LBB) techniques for online radicalization detection are 
discussed in section 4.1. 

Text classification (also known as text categorization) is the task of arranging 
text documents in pre-defined classes or categories. Machine Learning techniques 
are widely used for text classification [Sebastiani 2002] . Text classification tasks fol- 
low a typical pipeline which includes : manual labeling of documents into classes, 
document representation, training a classifier on seen data and evaluation on an un- 
seen test set. Documents are represented as feature vectors and these feature vectors 
are provided as input to train the classifier. There are various feature representa- 
tions which take into consideration the content and structure of the document, if 
available. The simplest feature representation consists of treating each term in the 
document as a feature, also called Bag-of- Words (BOW). Other feature representa- 
tions include tf-idf, bigrams, Part-of-Speech(POS) tags and syntax trees. Various 
types of classifiers like Naive Bayes, Support Vector Machines (SVMs) and Decision 
Trees have been used for classification [Sebastiani 2002]. Text classification tech- 
niques are generally evaluated on four parameters : Precision, Recall, F-Measure 
and Accuracy. The problem of detecting online radicalization can be treated as a 
binary (also called 2-class) text classification problem. Document representations 
include graphs, word level, character level, content, syntactic and lexical features. 
Machine learning classifiers like Naive Bayes, C4.5, SVM are used to train on the 
feature representations. Appropriate evaluation metrics like Precision, Recall, F- 
measure and Accuracy are reported to test the performance of the classification. 
Text classification techniques to detect online radicalization are discussed in section 
4.2. 

LBB and text classification are not the only techniques used to detect online 
radicalization. A few approaches like multi-agent based methods have also been 
used. We classify these methods into the other category. Section 4.3 discusses 
these specific methods in detail. 

3.1.2 Modalities. The Internet consists of various forms of communication modal- 
ities like websites, weblogs (also commonly known as blogs), social media (Twitter, 
Facebook, MySpace), image sharing websites (Flickr, Photobucket), video sharing 
websites (YouTubc, Dailymotion) and online forums. Various terrorist organiza- 
tions have created websites on the Internet [Gerstenfeld et al. 2003] [Weimann 2004] . 
These websites help the organizations in carrying out activities like fundraising, 
propaganda and recruiting [LEE and LEETS 2002]. Blogs (or weblogs) are online 
equivalent of diaries. Blogs are an efi^ective and powerful medium to communicate 
ideas and reach out to other like minded users. Hence, it's a befitting medium for 
hate propaganda and sharing ideology [Chau and Xu 2007]. Online forums (or a 
message board) are websites designed for users to post messages, news and other 
content. Forums foster discussions on specific topics, increase collaboration and 
help organized sharing of content. In addition, there are various freely available 
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softwares to create and host online forums. Most forums also require signing up in 
order to prevent detection from search engines. These characteristics of online fo- 
rums appeal to terrorist organizations to organize and spread information. Studies 
show that many forums on the web are racist in nature [Reid et al. 2005]. Social 
media like Facebook and Twitter have witnessed a meteoric rise in the recent past 
attracting a large user base. For example, Facebook has more than 800 million 
active users and used in more than 70 languages. ^ Due to enormous growth of 
social media, hate promoting users have turned to these sites to promulgate hate 
content and actively seek to connect individuals. 

3.1.3 Data Type. The content on the web consists of various data types like 
text, images, audio and video. Terrorist organizations utilize all these types to 
distribute material. Radicalization detection techniques focus on detecting all such 
variety data types. Multimedia processing is a relatively expensive task than text 
processing. Hate promoting videos are poor quality in nature which makes the use 
of multimedia difficult. Hence, most techniques rely on use of textual content or 
user generated content surrounding the multimedia. 

3.1.4 Features. Various types of features are used to assist techniques to detect 
online radicalization. The two commonly used features in the literature space of 
online radicalization are link based and content based features. 

Link based features analyze the structure between dociimcnts to make decisions 
on detection of radical content. Link features include hyperlinks between web sites, 
replies on common threads in online forums and relationships between users on 
social network websites. 

Content based features leverage the structure and content of a document. 
These include lexical (frequency of letters, average word length etc.), syntactic (fre- 
quency of punctuation words, function words etc.), domain specific, graph based 
and structural features (html markup, paragraphs etc.) Link based features are 
mainly used in LBB techniques while content based features are heavily used in 
text classification techniques. A few techniques use a combination of both link 
based and content based features. 

3.1.5 Evaluation. Precision, Recall, F-Measure and Accuracy are widely used 
and acceptable evaluation metrics. Precision measures the "exactness" of the tech- 
nique while Recall measures the "completeness" . Acciuacy, as the name suggests, 
evaluates the "correctness" of the solution. F-measure is the weighted harmonic 
mean of precision and recall, essentially combines the two metrics into a single unit 
of measurement. Lets consider Table III to formally define the evaluation metrics. 
We now formally define the evaluation metrics as follows : 

Precision = j^p^pp 

RpPflll — TP 
ivcuaii — TP+FN 

TT Moaciiro — 2 * Precision* Recall 
V- Measure - Predsion+Recall 

®https : //www. facebook. com/press/info. php?statistics 
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Table III. Evaluation data 



Actual Class 



True False 



Predicted Class Positive True Positive(TP) False Positive(FP) 
re ic e ass ]v,jgga,tive False Negative(FN) True Negative(TN) 



Arnirarv - TP+TN 
jT.L,v^uiaL,_y TP+FP+TN+FN 

An increase in precision signifies less false positives while a increase in recall 
signifies less false negatives. 

3.1.6 Output. There are three categories of output which are presented : users, 
content and communities. Users are simply Internet users on the web, online fo- 
rums and social media who identify themselves according to the screen name or a 
pseudonym. Content refers more to the type of documents and includes text and 
multimedia(images, video and audio). A common property observed amongst most 
structured networks like the WWW, social networks and collaboration networks is 
the formation of communities. A community can be loosely defined as a collec- 
tion of individuals with common interests or ideas. Analysis of communities and 
their interaction can reveal important insights like central leaders, structure and 
information flow. 

3.1.7 Language. Racist and hate promoting content is prevalent in multiple 
languages. Different languages create associated issues especially for radicalization 
detection techniques which utilize content. For example, Arabic text is written from 
right to left leading to a fundamental change in which document representations 
are formed from arable text. There may be multi-lingual content (both Arabic and 
English) present in the document. This leads to rethinking of methods to extract 
structural and syntactic features. 

3.1.8 Genre. Various genres of xenophobia exist on the Internet. Some of 
them include US domestic extremism. Middle East extremism, Anti-Semitism, hate 
against Blacks and anti-India hate. We also classify our articles into one of these 
genres. 

3.2 Online Radicalization Analysis : Facets 

In this part of the survey, we identify literature which contributes towards analyzing 
online radicalization either as their main goal or a part of their goals. We search 
for literature whose main aim is to analyze extremist content on the Internet. For 
example, we collect papers whose primary objective is to analyze extremist websites 
or hate promoting online forums. Concretely, the algorithms of papers in this 
section of the survey receive extremist (hate promoting or racist) content as input 
and the algorithms proceed to analyze extremist content on various parameters. 
For example, extremist videos on YouTube can be analyzed to determine users 
posting such extremist videos and the communities these users form. Moreover, 
social network analysis can also determine leaders in these communities. Once we 
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identify the literature, we proceed to analyze each paper and create a list of facets 
to create a structured research map. Each facet has a list of properties associated 
with it. The facets are as follows : 

* Types of Analysis : What are the types of analysis used to examine radical 

content on the web? 

* Link Analysis : What are the type of network analyses which are performed? 

* Content Analysis : What are the type of content analyses which are performed? 

* Modalities : Which modalities on the web do they consider? 

* Language : What languages do these techniques address? 

* Genre : What genre of hate is being detected? 

These facets and it's properties along with brief descriptions are listed in Table 
VI. 

3.2.1 Type of Analysis. There are various types of analysis performed on on- 
line radical content to gain insights and help understand the nature of the posted 
content. Extremist websites often link to each other and gather together to form a 
peculiar structure. It's important to analyze this structure to understand how these 
websites connect with each other and study their interactions. Web link analysis 
(or Web link mining) refers to the process of modeling and analyzing links between 
websites [Chakrabarti et al. 1999]. Link analysis can help discover communities, 
find central leaders and characterize various network properties. Content analysis 
on the other hand analyzes the content in the web sites. Linguistic patterns and 
textual clues can help understanding the nature of these websites and the reason be- 
hind the formation of these websites. It can also help throw light on the richness of 
the content and hence, depicting the technical sophistication of extremist websites. 
One can also gauge the affect (or emotion) towards various topics by analyzing 
the content. Moreover, these patterns can also reveal the objectives behind which 
these websites are created and the manner these websites arc being used. In some 
scenarios, performing link analysis or content analysis individually is insufficient. 
Hence, a combination of both link and content analysis is used in such scenarios. 

3.2.2 Network Based Analysis. Network based analysis (also called Link Anal- 
ysis) is used to understand the structure of links between hate promoting websites. 
The existence of a link between two nodes can be due to a hyperlink, friend re- 
lationship or due to some kind of interaction like a reply to e-mail or bulletin 
message. Link analysis helps examine the topology of the network and characterize 
the properties of the network. Average shortest path length, clustering coefficient 
and degree distribution are common metrics used to reveal properties of a network. 
Average shortest path length is the mean of all shortest paths between every node 
in a network. It measures the connectivity between any two of nodes in a network. 
Clustering coefficient measures the probability that two nodes in a network would 
belong to the same community. In-degree measures the number of links in to a 
node while out-links measures the number of links flowing out from a node. Degree 
distribution is the probability distribution of node degrees over a network. A net- 
work which follows a high clustering coefficient is termed as a small world network. 
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A network whose degree distribution follows a power law is termed as a scale-free 
network [Wasserman et al. 1994]. 

An important property of every social network is the ability of nodes in the net- 
work to group together as communities. Communities can be defined as a group 
of nodes with a common interest or topic. The group of nodes which form a com- 
munity are said to possess more intra-community links than intcr-community links. 
Various community detection algorithms have been proposed to identify the for- 
mation of communities [Girvan and Newman 2001] [Newman and Girvan 2004] 
[Newman 2004]. The usage of such community detection algorithms helps imder- 
stand the underlying community structure in extremist websites, forums and blogs. 
The interaction between these communities are also of great interest to help under- 
stand information flow. Blockmodeling can be used to understand the interactions 
between communities [Wasserman et al. 1994]. Visualization of communities is also 
an important issue as this information is consumed by elementary computer users. 
Various visualization methods like Multi Dimensional Scaling (MDS) and Snowflake 
Visualization have been used to understand the formation of communities [Freeman 
2000]. 

A subset of nodes in a network called central nodes arc the key to the existence 
and structure of the network. Central nodes generally act as nodes which link to 
most parts of the network or form "bridges" between communities. In order to 
identify central nodes in a network, three well studied measures are used : degree, 
betweenness and closeness [C. and Freeman 1979]. Degree measures the number of 
links incident on a node and thus is the measure of the node's popularity. The num- 
ber of shortest paths passing through a node in a network gives its "betweenness" 
measure. The sum of the length of shortest paths between one node and all other 
nodes in a network gives its "closeness" measure. Nodes with high betweenness are 
known to act as links between communities while nodes with high closeness arc well 
connected to most parts in the network. The removal of central nodes in a network 
can significantly alter the structure of the network. 

3.2.3 Content Based Analysis. Content based analysis is used to investigate 
into the content posted by extremist groups on the Internet. Affect analysis is 
used to understand the affect (or emotion) in text towards topics. Affect analysis 
can help gauge the intensity of violence, racism and hatred in extremist content. 
Authorship analysis is used to identify the owner of a written piece of text based 
on the author's stylistic cues. It can be used to detect Internet users in adversarial 
settings who post content via anonymous accounts. The presence of extremist 
groups on the Internet is to achieve a specific aim or objective. Some of these 
objectives include communication, fundraising, sharing ideology, propaganda and 
community formation. Content analysis can examine the content to understand the 
aim behind the presence of extremist groups on the Internet. Content analysis can 
also help examine the behavioral patterns in the usage of these extremist websites. 
This can help understand the level of technical sophistication and interactivity 
quotient used by the owners of these websites. Various other content information 
like IP address, message headers in forums and comments in blogs can help gain 
insights into the way extremist users use and interact with each other. Content 
analysis can also investigate into the topics talked about by users and find virtual 
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topic based communities. 

3.2.4 Modalities. Various modalities are used to host and spread information 
on the Internet hke websites, blogs and forums. Each of these modahties provide 
unique characteristics. For example, blogs provide an explicit option to add friends 
in the form of subscriptions. This information can be used to construct networks for 
further analysis. However, these modalities also bring a set of associated issues. For 
example, websites and forums may contain different types of content (text and/or 
multimedia) making it difficult to analyze the website. Also, nature of certain 
media like forums may not contain explicit links between users and forums even 
though they exist. 

3.2.5 Language. Language is an important facet similar to 3.1.7. Language 
particularly creates problems for content based analysis techniques especially when 
multi-lingual content is analyzed. 

3.2.6 Genre. Different genres of extremism are considered as a fact similar to 
3.1.8. This facet gives an understanding to which genres of extremism are analyzed. 

4. SURVEY OF AUTOMATED SOLUTIONS FOR ONLINE RADICALIZATION DE- 
TECTION 

In this section, we perform a survey of the techniques used to automatically detect 
online radicalization. We present a summary of these solutions in Table V. 

4.1 Link Based Bootstrapping (LBB) Techniques for Detecting Online Radicalization 

Link Based Bootstrapping (LBB) techniques are the amongst the most popular 
techniques used to detect online radical content [Reid ct al. 2005] [Zhou et al. 
2005] [Qin ct al. 2007] [Zhou et al. 2007] [Chen et al. 2008] [Chen et al. 2008] [Qin 
et al. 2011]. These techniques use a semi-automated approach to detect radical 
content on various modalities on the Internet including websites, blogs and online 
forums. First, a set of seed URLs are identified from authoritative sources. The 
URLs are then expanded using back link search and favorite links to accumulate 
related URLs. The idea is that extremist websites (or forums) link to each other 
and form some sort of a community structure. The expanded set is once again 
manually filtered by domain experts in order to avoid collecting off-topic pages. 
Web crawlers are used to download and collect content. Extremist forums may 
not be indexed my modern search engines forming a part of the web named the 
"Hidden Web" . Not only do these forums consist of rich multimedia content but also 
face accessibility issues like membership and adversarial detection. Hence, a semi- 
automated focused incremental crawler in conjunction with a recall-improvement 
based incremental update procedure is used to collect extremist forums on the 
"Hidden Web" [Fu et al. 2010] . Focused crawlers seeks, acquires, indexes, and 
maintains pages on a specific set of topics that represent a relatively narrow segment 
of the we6[Chakrabarti et al. 1999]. Forums which do not allow anonymous access 
require human expertise to apply for memberships to the webmaster. Appropriate 
spidering parameters, proxies, URL ordering techniques and wrapper parameters 
are chosen to make sure that maximum data is indexed also ensuring incremental 
updates. Duplicate content removal techniques are used to avoid multiple indexing. 
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This approach yields better results than standard spidering in conjunction with 
periodic or incremental update procedures. The Dark Web portal project heavily 
employs LBB techniques to collect extremist data [Reid et al. 2004] [Qin et al. 
2005]. This data collection includes multiple modalities (like websites and online 
forums) and various genres of hate like jihadist and US domestic hate.^ 

Sureka et al. also use LBB technique to identify extremist videos, users and 
implicit communities on YouTube [Sureka et al. 2010]. In particular, they manually 
identified a set of seed videos and then exploit the social relationships like friends, 
comments, subscriptions, favorites and playlists to expand the initial seed data. 
Conservative expansion is performed, in order to avoid off topic videos, based on 
in-degree, hub, information, out-degree and betweenness centrality. Starting from 
a seed of 60 users they were able to add 98 new users with 88% average precision. 

4.2 Text Classification Techniques for Detecting Online Radicalization 

Greevy and Smeaton treat the problem of detecting racist texts as binary classifica- 
tion problem [Greevy and Smeaton 2004] . They train a machine learning classifier. 
Support Vector Machine (SVM), and use Bag-of- Words (BOW), Bigrams and Part- 
of-Speech (POS) feature representations. They report best results for SVM with 
a polynomial kernel function and BOW features. They observe that a polynomial 
kernel function with the BOW representation performed outperforms other combi- 
nations. 

Last et al. propose a machine learning approach based on graph representation 
of web documents to classify multi-lingual terrorist content [Last et al. 2006] . They 
exploit the structure of the web page (title, anchor text and body) to create graph 
based representation of the documents. They use a sub-graph extraction algorithm 
in conjunction with a sub graph frequency inverse sub graph frequency metric to 
build the final representation. They used a C4.5 classifier to classify 648 Arabic 
web documents in terrorist or non-terrorist. 

Pu et al. and Huang et al. use a machine learning approach on user-generated 
content to classify extremist videos on video sharing websites [Fu et al. 2009] 
[Huang et al. 2010]. They create a testbed of 224 positive and negative videos on 
YouTube, manually tagged by extremism experts, to evaluate their approach. The 
user-generated content includes d(^scription, title, aiithor names, names of other 
uploaded videos, author name, comments, categories and tags. They used four 
feature sets : lexical(character-based, word-based), syntactic (frequency of func- 
tion and punctuation words), content-specific features(word and character n-grams, 
video tags and categories) and feature selection strategy (using Information Gain) 
as training data for their classifiers. They used three classifiers: C4.5, Naive Bayes 
and SVM to train and test their model. The report best accuracy results on fea- 
ture selection strategy document representation with SVM. Lexical and Syntactic 
features are reported as the key discriminating features. 

4.3 Other Techniques for Detecting Online Radicalization 

There are other techniques explored too for detecting online radicalization. Ak- 
nine et al. treat the problem of racist text classification as a 3-class problem : 

''http : //128 . 196 . 40 . 222 : 8080/CRI_Indexed_new/login . j sp 
IIITD PhD Comprehensive Report, Vol. V, No. N, January 2013. 



A. Sureka • 19 



racist, anti-racist and neutral. They use a multi-agent based approach to detect 
racist documents on the Internet [Aknine et al. 2005]. They use three agents : 
query agcnts(to query the web), document agcnts(to fetch documents) and crite- 
ria agents(for linguistic features) in a pyramidal co-ordination framework. They 
evaluate their system on English, French and German languages. 

5. SURVEY OF SOLUTIONS FOR ONLINE RADICALIZATION ANALYSIS 

In this section, we perform a survey of the techniques used to analyze online radi- 
calization. We present a summary of these solutions in Table VII. 

5.1 Network Based Analysis for Online Radical Content 

Link based analysis exploits the structure of links between nodes in a network 
to analyze the topological characteristics of a network. Websites on the Internet 
contain hyperlinks which refer to external as well as internal web pages. These 
hyperlinks can be analyzed to gain further insight into the network's topological 
characteristics. 

Various approaches in literature analyze such a hyperlink structure between ex- 
tremist websites [Rcid et al. 2005] [Zhou et al. 2005] [Chen 2007] [Chen et al. 2008] 
[Zhang et al. 2009]. Typically, the entire extremist network is treated as a graph 
where each extremist website is a node and the edges are hyperlinks. One of the 
main objectives of exploring links between websites is to detect implicit commimi- 
ties and visualize them. In order to achieve that objective, each link between the 
nodes is assigned a weight depending on the page level in the site hierarchy. This 
graph is then fed to a visualization algorithm, Multi-Dimensional Scaling (MDS). 
MDS arranges the nodes in the network such that the most similar nodes are close 
to each other while the non-similar nodes are far off from each other. Many studies 
use MDS along with the page level similarity metric to discover hyper linked ex- 
tremist communities [Reid et al. 2005] [Zhou et al. 2005] [Chen 2007] [Chen et al. 
2008] . In case of blogs sharing websites, each node is treated as a user and an edge 
is either a subscription or a group co-membership (belonging to the same blog ring) 
relationship [Chau and Xu 2007] . Similarly, in case of forums each user is marked 
as a node and a reply/message on a thread (interaction/activity) is treated as a 
link between the two users. In addition to finding communities, it's also important 
to find central players in these communities. Zhou et al. find central nodes in a col- 
lection of US domestic hate groups [Zhou et al. 2005] . Chau et al. use betweenness 
and degree measures to discover central nodes in White Supremacist blogs in the 
US [Chau and Xu 2007]. Central nodes help structure other nodes and facilitate 
information flow in a network. These central nodes can be vital break down points 
in order to debilitate such networks. 

Apart from analyzing the link structure in a networks, it's also important to 
investigate the topological characteristics of networks. Average Path Length, Clus- 
tering Coefficient, Degree distribution are a few metrics widely used to explain 
network topologies. These metrics are usually calculated on the giant component 
in a network. Giant component is the largest connected (not disjoint) component 
in a graph and hence, a representative subset of the network in study. Chau et 
al. analyze the topological characteristics of the giant component in anti-Black 
extremist blog rings on Xanga, a popular blog sharing website [Chau and Xu 2007] . 
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Xu et al. analyze the structural characteristics of Middle-Eastern, Latin Ameri- 
can, US domestic extremist websites. Global Salafi Jihad (terrorist) network and 
Mcth Word network (illegal drug trafficking network) [Xu et al. 2006] [Xu and Chen 
2008]. In all the above studies, network characteristics like average path length, 
clustering coefficient and degree distribution were calculated. The degree distribu- 
tion follow a power-law and highlights preferential attachment treatment in these 
networks where new users are attracted to popular nodes. Such networks are called 
scale-free networks and depict the "rich get richer" phenomena. The average path 
length observed is small indicating these networks are relatively small-world. This 
shows that a node can connect with any other node in very few hops (five or less) . A 
high clustering coefficient shows that there are dense local links leading to optimal 
efficiency in communication between nodes. 

In some cases, just studying the topology of the network isn't enough. A combi- 
nation of both network and content is considered desirable to enable deep under- 
standing of the network in study. L'HuUier et al propose a topic centered social 
network analysis approach to discover virtual communities in online extremist fo- 
rums [L'Huillier et al. 2010]. They first create a network with links which show 
interaction between members in the forums. A link between two members signi- 
fies interaction between the members like posting a message on a common thread. 
Members are then filtered according to topics to find topic centered virtual com- 
munities. These topics are extracted using Latent Dirichlet Allocation (LDA), a 
topic modeling algorithm. HITS algorithm is employed to find information hubs 
and authoritative members in the network. 

5.2 Content Based Analysis for Online Radical Content 

Content based analysis investigates patterns in the content posted by extremist 
websites and groups. These patterns can lead to insights in the functioning of these 
extremist organizations and the purpose of the content posted by these groups. 

Users in extremist forums may post hate promoting content under the condition 
of anonymity. Authorship analysis can help attributing content to the actual owner 
in an adversarial setting. Authorship analysis profiles the stylistic variations in the 
content posted by a user. Some studies apply authorship analysis to multi-lingual 
(English and Arabic) messages in online extremist forums [Abbasi and Chen 2005] 
[Chen 2007] . Authorship identification is treated as a multi-class text classification 
problem where each class is an author. A combination of multiple features including 
lexical (world length distribution, frequency of letters), syntactic (punctuation, 
function words), structural (font size, font color) & content-specific (race, gender) 
are used to profile user's content and document representation. In order to address, 
Arabic language specific issues like diacritics, inflection and word elongation an 
Arabic language parser is used. Two classffiers C4.5 and SVM are used and feature 
sets are added incrementally. SVM with combination of all feature sets perform 
the best on both English and Arabic messages. 

Extremist groups use and maintain websites for various purposes to achieve their 
objective of xenophobia. A few studios manually classify extremist websites into 
eight pre-defined categories based on their content [Reid et al. 2005] [Zhou et al. 
2005] [Chen 2007] [Chen et al. 2008]. These categories include communications, 
fundraising, sharing ideology, propaganda(insiders) , propaganda (outsiders), virtual 

IIITD PhD Comprehensive Report, Vol. V, No. N, January 2013. 



A. Sureka • 23 



community, commands & control and recruitment & training. The strength (or 
intensity) of these categories are displayed using Snowflake Visualization. Snowflake 
visiialization helps display the intensity of a variable across multiple dimensions 
[Reid et al. 2005] [Chen et al. 2008] [Chau and Xu 2007]. These studies show that 
most terrorist organizations primarily use the Internet to share their ideology. 

Affect analysis measures the quantity of affect (or emotion) in given content. Af- 
fect analysis can help understand the emotions of users towards various topics. Few 
studies analyze the intensity of emotions in international extremist forums [Abbasi 
2007] [Chen 2008]. Abbasi manually creates an affect lexicon using a probability 
disambiguation technique [Abbasi 2007]. The affect lexicon is used to calculate 
an affect intensity score for each class. Abbasi reports that the intensity of hate 
and violence related affects are higher in Middle Eastern forums than US domestic 
extremist forums. Chen uses a machine learning classifier to analyze affects in two 
jihadist web forums : Al Firdaws and Montada [Chen 2008]. Various linguistic 
features like character n-grams, word n-grams, root n-grams and collocations are 
extracted as indicators. A recursive feature elimination technique (RFE) in con- 
junction with Information Gain (IG) heuristic is used to reduce feature dimensions. 
An SVR ensemble classifier is trained with and evaluated on a manually created 
test bed. Al Firdaws postings contained more violence, hate and racism related 
affects than messages on Al Montana forum. 

Few studies analyze extremist websites in terms of its technical sophistication, 
content richness and web interactivity [Qin et al. 2007] [Chen et al. 2008] [Qin 
et al. 2011]. Technical sophistication attributes include use of dynamic web pro- 
gramming, embedded multimedia, dynamic web programming scripts. Content 
Richness attributes include hyperlinks, software and file downloads. Web interac- 
tivity attributes include guest books, chat rooms, online shops and feedback forms. 
Studies report that extremist websites make use of advanced communication meth- 
ods like email and chat to facilitate flow of information. These studies also show 
that extremist groups also more of multimedia content (images, video) in order to 
make a quick and impressionable impact on the eye of the reader. 

5.3 Other Solutions for Analyzing Online Radical Content 

There are some other types of analysis performed in extremist online forums. Some 
studies also analyze the content to discover important topics in extremist content 
[Yang et al. 2009]. A topic modeling algorithm like Latent Dirichlet Allocation 
(LDA) is used to model the documents as a collection of topics. Other studies map 
the geo locations of ISPs, cities and countries of Jihadist extremist web sites [Mielke 
and Chen 2008] . The mapping reveals that major extremist websites are hosted in 
the US or Europe based countries. Kramer proposes to use Finite Lyapunov Time 
Exponents (FLTE) to find anomalies in extremist forums postings [Kramer 2010]. 
FLTEs are considered good indicators of stability in dynamic systems and have 
wide applications in control systems research. The idea is that the distribution of 
words in an anomalous environment changes from the normal distribution. Kramer 
models the text posted in forums by each user as a time-series equation, tf-idf scores 
of the words posted by the user are used to form equations. FLTEs are then used 
to check the "change" in distributions which if positive amounts to an anomaly. 
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6. DISCUSSION 

In this section, we discuss the analysis of facets selected for paper classification with 
respect to the initial objectives of this literature survey. We revisit the solution 
space and synthesize the solutions to determine trends and limitations. 

6.1 Online Radicallzation Detection 

Figure 3 shows the timeline of literature in online radicalization detection. Each 
paper is depicted by a red square box accompanied by the author and the title of 
the paper. The distance from the axis is proportional to the number of citations 
currently received by the paper. 

(a) Techniques 

There are two major techniques used to detect (collect or search or find) online 
radical content : Link Based Bootstrapping (LBB) and Text Classification. 
LBB makes use of a startup or seed data (identified by extremist experts) and 
then exploits the link (typically hyperlink or social links) in the data to snowball 
the seed and obtain an expanded set. The irrelevant or off-topic data in the 
expanded set is filtered out manually by domain experts. Text classification 
on the other hand handles the problem of detecting online radical content as 
a binary text classification problem. It examines the linguistic features in the 
content and uses a machine learning classifier to make decisions. 

(b) Trends 

Figure 4 shows the trends observed in literature in the solution space of online 
radicalization detection. We observe that LBB technique is the most pop- 
ular technique to detect online radicalization. These techniques exploit the 
co-citation phenomena where "like minded" groups link to each other. Hence, 
most of these techniques use link based features rather than content based fea- 
tures. We also notice that the most of the literature detect extremist content 
on websites. Relatively less attention is devoted to online forums and social 
network websites. Amongst social media websites solutions are targeted to- 
wards YouTube, a popular video sharing website. We reason that this may be 
due to the appealing nature of multi- media and its immediate impact. Other 
popular online social media like Twitter, Facebook and Google+ are ignored. 
Also, Middle Eastern genre of extremism is the most popular followed closely 
by US domestic and Latin American hate. 

(c) Limitations 

Both LBB and text classification techniques proposed in literature have their 
limitations. LBB techniques are semi-automated and require tremendous man- 
ual effort to filter out irrelevant content. Content on the Internet is very 
ephemeral and tends to change patterns in time. LBB based techniques are 
difiicult to cope up on temporal and fieeting extremist content. Text classi- 
fication techniques on the other hand make some assumptions on the data. 
The task of detecting radical content is treated as a binary classification prob- 
lem. However, the amount of radical content on the Internet is small than 
the amount of non-radical content. Hence, the dataset used to evaluate text 
classification should be an imbalanced dataset. This aspect is not being con- 
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sidered in solutions which use text classification to solve the problem of online 
radicalization [Aknine et al. 2005] [Greevy and Smeaton 2004]. 

6.2 Online Radicalization Analysis 

Figure 5 shows the timeline of literature in the space of online radicalization anal- 
ysis. Each paper is depicted by a red square box accompanied by the author and 
the partial title of the paper (for better readability). The distance from the axis is 
proportional to the number of citations currently received by the paper. 

(a) Type, of Analysis 

Online radical content can be analyzed cither in terms of the network formed 
(structure of hyperlinks or social links) or by the information it contains. The 
main goal of analyzing the content is to provide actionable information to law 
enforcement agencies. Various network based (communities, leaders and topo- 
logical characteristics) and content based (authorship identification, website 
activities, affect analysis, usage) are performed on online radicalization con- 
tent. 

(b) Trends 

Figure 6 shows the trends observed in literature in the solution space of online 
radicalization detection. Network based analysis is the most popular type of 
analysis performed on online radical content. Most network analyzes focus on 
detecting communities and their characteristics as these are the most pertinent 
information to a law enforcement agency or security analyst. Also, the most 
common modalities focused on for analysis are websites closely followed by 
online forums. We argue that this may be because Middle-Eastern cxtrcnnist 
groups (most studied genre of extremism) are technically sophisticated to host 
websites and online forums. These modalities allow these groups better control 
over users, content and activity. These modalities also allow the existence of 
various interactive methods like live broadcast radio and chat rooms. 

(c) Limitations 

Different network based and content based techniques are used to analyze radi- 
cal content. Network based techniques throw light on the community structure, 
key leaders and topological characteristics of the network. The techniques used 
in the literature don't use formal community detection algorithms to detect 
communities [Reid et al. 2005] [Chau and Xu 2007] [L'Huillier et al. 2010]. 
These techniques rely more on visual inspection using graph layouts which can 
be misleading and unreliable. Content based techniques analyze the structure 
& type of content to gauge an understanding of the purpose of the extrem- 
ist website like propaganda, fundraising, sharing ideology etc. Currently, the 
techniques in literature perform this classification manually by the help of ex- 
tremism experts [Chen 2007] [Chen et al. 2008]. In practice, this may not be a 
feasible solution for a security analyst. 

7. RESEARCH GAPS 

Based on our classification scheme and its analysis of literature in the space of 
Automated Online Radicalization Detection and Analysis we identify few research 
gaps based on the facets we considered for our paper classification scheme. 
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Fig. 3. Timeline of literature in Online Radicalization Detection. The red square 
box denotes the number of citations received. The distance from the axis is pro- 
portional to the number of citations received by each paper. 
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Fig. 4. Trend Analysis of literature in online radicalization detection. The per- 
centages show the number of papers included in our survey which contain the 
corresponding dimension 
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Fig. 5. Timeline of literature in the space Online Radicalization Analysis. The red square box 
denotes the number of citations received. The distance from the axis is proportional to the number 
of citations received by each paper. 
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Fig. 6. Trend Analysis of literature in online radicalization analysis. The per- 
centages show the number of papers included in our survey which contain the 
corresponding dimension 
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(I) Modalities : Twitter, Facebook 

Each modality (micro-blogging website, video-sharing website, photo-sharing 
website, blogospherc, online chat, online forums or discussions threads) poses 
unique technical challenges and hence a focused approach is required to de- 
velop techniques for a specific modality. While there has been work done in 
the area of video-sharing websites such as YouTubc and blogospherc, micro- 
blogging website (Twitter being the most popular example) is a modality 
which is unexplored. We also notice that mining FaceBook in the context 
of automatic detection of online radicalization is an area which is relatively 
unexplored. 

(II) Online Radicalization Detection : Activity Based Detection 

Our survey reveals that content-based and link-based features have been used 
as effective signals to identify hate-promoting content. However, activity- 
based features (usage-based) are not yet explored. We believe that investigat- 
ing the application of activity-based features for the task of hate-promoting 
content and user detection is an open research problem. 
(Ill) Online Radicalization Analysis: Community Detection 

Community formation is an important property of most social networks like 
WWW, Twitter and Facebook. Hence, they are significant actionable in- 
formation for law enforcement agencies. In literature, we observe that com- 
munity detection is performed by visual layouts. However, algorithms for 
community detection have been well studied and evaluated on real world 
networks like WWW and collaboration networks. These algorithms haven't 
been applied on extremist networks. Moreover, networks on social media like 
Twitter are constantly evolving and temporal in nature. Therefore, there's 
a need of studying both static as well as dynamic community detection algo- 
rithms. 

8. CONCLUSION 

Automated solutions to detect and counter cyber-crime (to address the needs of 

law-enforcement and intelligence-agencies) related to promotion of hate and radi- 
calization on the Internet is an area which has recently attracted a lot of research 
attention. In this survey paper, we reviewed the state-of-the-art in the area of 
automated solutions for detecting online radicalization on Internet and social me- 
dia websites. We presented a novel multi-level taxonomy or framework to organize 
the existing literature and present our perspective on the area. We categorized 
existing studies (based on a multi-level taxonomy) on various dimensions such as 
the social media websites (YouTube, Twitter, FaceBook), technique employed (Ma- 
chine Learning, Social Network Analysis) and modality (Websites, Online Forums, 
Blogs, Social Media). We reviewed key results, compared and contrasted various ap- 
proaches and offered our fresh perspective on trends based on the evidence derived 
from our extensive collection of literature in the area. We also identified research 
gaps in this line of research, discussed unsolved problems and present potential 
future work. 
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